Suicide Rates Data Exploration

by Shahad Aljurbua

Preliminary Wrangling

This document explores a dataset containing suicide rates overview from 1985 to 2016.

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline
In [3]:
# load in the dataset into a pandas dataframe
df = pd.read_csv('master.csv')
In [4]:
df.head()
Out[4]:
country year sex age suicides_no population suicides/100k pop country-year HDI for year gdp_for_year ($) gdp_per_capita ($) generation
0 Albania 1987 male 15-24 years 21 312900 6.71 Albania1987 NaN 2,156,624,900 796 Generation X
1 Albania 1987 male 35-54 years 16 308000 5.19 Albania1987 NaN 2,156,624,900 796 Silent
2 Albania 1987 female 15-24 years 14 289700 4.83 Albania1987 NaN 2,156,624,900 796 Generation X
3 Albania 1987 male 75+ years 1 21800 4.59 Albania1987 NaN 2,156,624,900 796 G.I. Generation
4 Albania 1987 male 25-34 years 9 274300 3.28 Albania1987 NaN 2,156,624,900 796 Boomers
In [5]:
# data overview 
print(df.shape)
print(df.dtypes)
(27820, 12)
country                object
year                    int64
sex                    object
age                    object
suicides_no             int64
population              int64
suicides/100k pop     float64
country-year           object
HDI for year          float64
 gdp_for_year ($)      object
gdp_per_capita ($)      int64
generation             object
dtype: object
In [6]:
df.describe()
Out[6]:
year suicides_no population suicides/100k pop HDI for year gdp_per_capita ($)
count 27820.000000 27820.000000 2.782000e+04 27820.000000 8364.000000 27820.000000
mean 2001.258375 242.574407 1.844794e+06 12.816097 0.776601 16866.464414
std 8.469055 902.047917 3.911779e+06 18.961511 0.093367 18887.576472
min 1985.000000 0.000000 2.780000e+02 0.000000 0.483000 251.000000
25% 1995.000000 3.000000 9.749850e+04 0.920000 0.713000 3447.000000
50% 2002.000000 25.000000 4.301500e+05 5.990000 0.779000 9372.000000
75% 2008.000000 131.000000 1.486143e+06 16.620000 0.855000 24874.000000
max 2016.000000 22338.000000 4.380521e+07 224.970000 0.944000 126352.000000
In [7]:
df.isnull().sum()
Out[7]:
country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64
In [8]:
df.drop(('HDI for year'), axis=1, inplace=True)
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27820 non-null  object 
 1   year                27820 non-null  int64  
 2   sex                 27820 non-null  object 
 3   age                 27820 non-null  object 
 4   suicides_no         27820 non-null  int64  
 5   population          27820 non-null  int64  
 6   suicides/100k pop   27820 non-null  float64
 7   country-year        27820 non-null  object 
 8    gdp_for_year ($)   27820 non-null  object 
 9   gdp_per_capita ($)  27820 non-null  int64  
 10  generation          27820 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 2.3+ MB
In [10]:
df.sample(10)
Out[10]:
country year sex age suicides_no population suicides/100k pop country-year gdp_for_year ($) gdp_per_capita ($) generation
2005 Austria 2003 male 25-34 years 118 569434 20.72 Austria2003 261,695,778,781 33889 Generation X
17052 Montenegro 2008 female 15-24 years 1 46646 2.14 Montenegro2008 4,545,674,528 7705 Millenials
2214 Azerbaijan 1993 female 5-14 years 1 759300 0.13 Azerbaijan1993 3,973,027,397 617 Millenials
13933 Kazakhstan 2006 female 75+ years 53 282507 18.76 Kazakhstan2006 81,003,884,545 5770 Silent
11324 Guyana 2010 female 35-54 years 8 86183 9.28 Guyana2010 2,273,225,042 3384 Generation X
8142 El Salvador 1997 male 55-74 years 47 245832 19.12 El Salvador1997 10,221,705,900 2065 Silent
6211 Costa Rica 2001 female 55-74 years 0 200525 0.00 Costa Rica2001 15,913,363,335 4412 Silent
11523 Hungary 2004 female 75+ years 146 450461 32.41 Hungary2004 104,066,609,518 10806 Silent
2007 Austria 2003 male 15-24 years 91 501938 18.13 Austria2003 261,695,778,781 33889 Millenials
26737 United Kingdom 2006 female 15-24 years 74 3911869 1.89 United Kingdom2006 2,692,612,695,492 47163 Millenials
In [11]:
yearlySum = df.groupby('year').sum()
yearlySum
Out[11]:
suicides_no population suicides/100k pop gdp_per_capita ($)
year
1985 116063 1008600086 6811.89 3508548
1986 120670 1029909613 6579.84 4104636
1987 126842 1095029726 7545.45 5645760
1988 121026 1054094424 7473.13 5870508
1989 160244 1225514347 8036.54 6068424
1990 193361 1466620100 9878.75 7531260
1991 198020 1489988384 10321.06 7782096
1992 211473 1569539447 10528.88 8195232
1993 221565 1530416654 10790.29 8231796
1994 232063 1548749372 11483.79 9438756
1995 243544 1591559103 14660.26 11858508
1996 246725 1662267662 14142.21 11600736
1997 240745 1702991519 13817.83 11398596
1998 249591 1725181351 14150.72 11506728
1999 256119 1776363155 14473.91 12780864
2000 255832 1799227908 14387.45 12865476
2001 250652 1755565489 14276.21 12677892
2002 256095 1822152815 14227.72 13017420
2003 256079 1838458020 13627.58 15187104
2004 240861 1745246613 12581.80 17895936
2005 234375 1734909645 12164.99 20317212
2006 233361 1840908837 12166.01 21563784
2007 233408 1859564353 12410.15 24709620
2008 235447 1860620851 12145.84 26936208
2009 243487 1976228366 12176.04 24145248
2010 238702 1997297329 11843.99 25193196
2011 236484 1993362332 11367.84 26936148
2012 230160 1912812088 11101.91 26058300
2013 223199 1890161710 10663.64 26911368
2014 222984 1912057309 10306.73 25665252
2015 203640 1774657932 8253.99 19516008
2016 15603 132101896 2147.39 4106420
In [12]:
#removing 2016 because it's incomplete
df = df[df.year != 2016]
In [13]:
df.hist(figsize=(10,8));
In [14]:
sb.pairplot(df)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x1a207cab50>

Dataset Overview

This document explores a dataset containing suicide rates overview from 1985 to 2016. There are 27,820 rows in the dataset and it follows a format of 1 row and number of suicides per country, year, sex and age group. The variables represent all of main data related to the suicides such as count of suicide, country, population of the country, year, sex and age group, as well as the rate of suicides per 100k . Regarding the data types. The dataset contains numeric and categorical variables.

Investigation Overview

I'am intersted in exploring the signals correlated to increased suicide rates among the world.

Univariate Exploration

In [15]:
fig, ax = plt.subplots(nrows=3, figsize = [8,8])

variables = ['suicides/100k pop', 'age', 'sex']
for i in range(len(variables)):
    var = variables[i]
    ax[i].hist(data = df, x = var)
plt.show()    
In [16]:
df['sex'].value_counts()
Out[16]:
female    13830
male      13830
Name: sex, dtype: int64
In [17]:
df['age'].value_counts()
Out[17]:
55-74 years    4610
25-34 years    4610
15-24 years    4610
75+ years      4610
35-54 years    4610
5-14 years     4610
Name: age, dtype: int64

It seems that there are an equal number of rows for each age group and each sex and other variables. Thus, I will explore the data with the total number of suicedes for each variable.

Lets invistigate the total suicide distribution among the variables.

Global suicide totals by age

In [18]:
age = df.loc[:,['age','suicides_no']]
age['suicides_sum'] = age.groupby(['age'])['suicides_no'].transform('sum')
age.drop('suicides_no', axis=1, inplace=True)
age = age.drop_duplicates()
fig=px.bar(age,x='age', y='suicides_sum', title='Suicide Totals  By age',  
           category_orders={"age":['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years','75+ years']})
fig.show()     

Suicide totals are found higher between middle aged indivduals. But is it gonna be also the higher if we consider the rate?

Global suicide totals by sex

In [19]:
sex = df.loc[:,['sex','suicides_no']]
sex['suicides_sum'] = sex.groupby(['sex'])['suicides_no'].transform('sum')
sex.drop('suicides_no', axis=1, inplace=True)
sex = sex.drop_duplicates()
fig=px.pie(sex,names='sex', values="suicides_sum", title= 'Suicide Totals  By Sex')
fig.show()

Male's total suicedes are 3/4 of total suicedes for both sex.

Global suicide totals over the years 1985-2015

In [20]:
year = df.loc[:,['year','suicides_no']]
year['suicides_sum'] = year.groupby(['year'])['suicides_no'].transform('sum')
year.drop('suicides_no', axis=1, inplace=True)
year = year.drop_duplicates()
fig=px.bar(year,x='year', y="suicides_sum", title= 'Suicide Totals Over The Years')
fig.show()   

The largest number of total suicides per year was in 1999.

Global suicide totals by country

In [21]:
country = df.loc[:,['country','suicides_no']]
country = country.groupby('country')['suicides_no'].sum().reset_index()
country = country.sort_values('suicides_no')
country = country.tail(20)
fig = px.bar(country, x='suicides_no', y='country', title= 'Suicide Totals By country')
fig.show()

As the plot show, Russia has the higher number of suicedes.

Bivariate Exploration

In this section, I will invistigate the relationship between the suicide rates and the other variables.

Global Suicide Trend Over The Years 1985-2015

Let's first explore the suicide trend over the years.

In [22]:
plt.figure(figsize=(8,6))
plt.title('Global Suicide Trend Over Years', fontsize=16)
ys=sb.lineplot(data=df, x='year', y='suicides/100k pop')
ys.set(xlabel='year', ylabel='Suicides per 100k ');
  • The suicides rate reach it's peak in 1995 with a rate of 15.5.
  • After 1995, the rate deacreased slightly.

Now let's consider the other variables

Global Suicide Rate By Age

In [23]:
plt.figure(figsize=(8,6))
plt.title('Global Suicide Rate By Age', fontsize=16)
base_color = sb.color_palette()[0]
age_order = ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years','75+ years']
chart=sb.barplot(data = df ,x = 'age',y = 'suicides/100k pop', ci = None, color = base_color, order= age_order)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45);
  • Unlike the total suicides, the suicide rate is higher between old aged individuals.
  • The rate of suicides are expected to increase with age.

Global Suicide Rate By Sex

In [24]:
plt.figure(figsize=(8,6))
plt.title('Global Suicide Rate By Sex', fontsize=16)
sx =sb.barplot(data = df ,x = 'sex',y = 'suicides/100k pop' ,ci = None)
sx.set(xlabel='Sex', ylabel='Suicides per 100k');
  • The rate of male suicides are three times of female suicide rate.

Global Suicide Rate By Country

To get the suicide rate for each country i will get the mean first.

In [25]:
df.groupby(['country'])['suicides/100k pop'].agg(['sum', 'size', 'mean']) 
Out[25]:
sum size mean
country
Albania 924.76 264 3.502879
Antigua and Barbuda 179.14 324 0.552901
Argentina 3894.59 372 10.469328
Armenia 935.65 288 3.248785
Aruba 1596.52 168 9.503095
... ... ... ...
United Arab Emirates 94.89 72 1.317917
United Kingdom 2790.92 372 7.502473
United States 5140.97 372 13.819812
Uruguay 6538.96 336 19.461190
Uzbekistan 2138.17 264 8.099129

100 rows × 3 columns

In [26]:
country = df.loc[:,['country','suicides/100k pop']]
country = country.groupby('country')['suicides/100k pop'].mean().reset_index()
country = country.sort_values('suicides/100k pop', ascending=False)
country = country.head(30)
plt.figure(figsize=(20,15))
plt.title('Global Suicide Rates By Country', fontsize=18)
base_color= base_color = sb.color_palette()[0]
chart=sb.barplot(data = country ,x = 'suicides/100k pop',y = 'country', color= base_color)
chart.set(xlabel='Suicides per 100k', ylabel='Country')
sb.despine(left=True, bottom=True);

Clearly Lithuania has the highest suicide rate (41 suicides per 100k)

Multivariate Exploration

Here I will invistigate the trend during the years for different variables.

Global Suicide Rate Trend Over The Years (By Sex)

In [27]:
g = sb.FacetGrid(data = df, hue = 'sex', height = 7)
g.map(plt.scatter, 'year','suicides/100k pop')
g.set(xscale = 'log')
x_ticks = [1985, 1990, 1995, 2000, 2005, 2010, 2015]
g.set(xticks = x_ticks, xticklabels = x_ticks)
plt.title('Global Suicide Rates Trend Over The Years (By Sex)', fontsize=16)
g.set(xlabel='Year', ylabel='Suicides per 100k')
plt.ylim([0,100])
g.add_legend();
  • During The 80s, the rate for both male and female were low.
  • In the mid of 90s and after the male rate increased.

Global Suicide Rates Trend Over The Years (By Age)

In [28]:
g = sb.FacetGrid(data = df, hue = 'age', height = 7,  hue_order=['75+ years', '55-74 years', '35-54 years', '25-34 years', '15-24 years','5-14 years'])
g.map(sb.lineplot, 'year','suicides/100k pop')
g.set(xscale = 'log')
x_ticks = [1985, 1990, 1995, 2000, 2005, 2010, 2015]
g.set(xticks = x_ticks, xticklabels = x_ticks)
plt.title('Global Suicide Rates Trend Over The Years (By Age)', fontsize=16)
g.set(xlabel='Year', ylabel='Suicides per 100k')
g.add_legend();
  • The Suicide rate for individuals above 75+ has dropped since 1995.
  • Suicide rate for children under 14 remains very consistent and small.
In [29]:
!jupyter nbconvert Data_Visualization.ipynb --to slides --post serve --template output_toggle
[NbConvertApp] Converting notebook Data_Visualization.ipynb to slides
[NbConvertApp] Writing 585446 bytes to Data_Visualization.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/Data_Visualization.slides.html
Use Control-C to stop this server
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 2.77ms
^C

Interrupted